A short description of the post.
Since 20 years ago, GASTech has been operating a natural gas production site in the island country of Kronos. The business has been profitable, and the company has also developed close relationships with the Kronos Government. In January 2014, following GASTech’s initial public offering listing, several GASTech employees has gone missing. An organization known as Protectors of Kronos (POK) is suspected in these cases of missing persons, as GASTech’s business moves had not been too environmentally friendly. A thorough investigation is to be carried out by the law enforcers of Kronos and Tethys to break this case.
Information and data pertaining to the whereabouts of company cars, purchases made by employees in local stores have been provided to the law enforcers. We shall use visual analytics to sense-make this data to facilitate the investigation.
This would be done as a sub-component which would eventually feed into an interactive Shiny app for the use of the law enforcers, together with other sub-components covered by my project group mates. The objective of this assignment is to explore what the insights are and how they can be brought out from the depths of this dataset to aid in the investigation.
Crime analysis is a law enforcement function which involves systematic analysis for identifying and analyzing patterns and trends in crime and disorder [Wikipedia]. Too little data would inevitably limit the efficiency of the investigation, but overwhelming volume of information could pose a huge challenge as well. Coupled with the need for rapid analysis, too much information to absorb, categorize, remember and draw meaning from could compromise the overall investigation [Crime-Analysis ScienceDirect]. Visual analytics techniques could be employed to gain useful insight from massive raw data.
For efficiency in data processing, information must first be consistent as subtle differences can greatly increase variability and reduce the reliability and value of a dataset [Crime-Analysis ScienceDirect]. Next, there cannot be information overload within a diagram. Good practices such as appropriate brushing and linking, selecting and marking, aggregation, elimination, virtual navigation techniques such as zooming, focus + context, and details-on-demand techniques have been studied and used to overcome an over-cluttered screen.[Ku et. al, 2016]. Uninteresting and expected patterns can also be unmarked to improve efficiency and reduce false positives. [Arxiv]
According to [Arxiv, Visual of Anomalous User Behavior], detection of anomalous user behaviors can be a challenging task as the boundary between abnormal and normal data may not be clearly defined, and approaches like machine learning lack contextual information to support decision-making. Visualization techniques like sequence visualization, graph visualization, text visualization, geographic visualization, chart visualization can be combined with interaction methods like tracking and monitoring, pattern discovery, exploration and navigation to analyze anomalous user behaviors.
Analysis of anomalous travel behaviors can take the following approaches:
Analysis of anomalous transactions can take the following approaches:
According to Robert Krueger [Year], movement data is more complex to handle than simple point-based data as it contains complex hierarchical structures of overlapping trajectories with diverse shapes and directions.
Movements can be spatially aggregated. A full spatial and temporal aggregation of the trajectories can result in a static graph G = (V,E) consisting of nodes V and edges E. Each edge e = (u,v) can encode directions and contain a weight that holds the travel volumes between the nodes. Analysis is flexible with this graph network and techniques such as clustering, segmentation, aggregation can be performed.

This is the world famous cholera map produced by Doctor John Snow in 1854. Each bar plotted onto the map represents a death case. It was then immediately apparent where the deaths clustered, enabling investigation to be directed and focused. It eventually led them to the culprit water pump in Broad street which was polluted by sewage water tainted by a disposed baby nappy with cholera. Death statistics on their own might not have led to this discovery this soon had it not been geo-localized. This spatial autocorrelation is powerful.
Visualizations can include many types such as thematic maps, scatter plots, parallel coordinate plots, timelines and a wide range of other techniques. Interactivity to allow quick switching between these views can facilitate more insights. These systems are described generally as coordinated-view visualizations. An example is shown as below. Global population trends by country are compared using a parallel coordinate plot, choropleth map and treemap. (Image courtesy of the National Center for Visual Analytics at Linkoping University)

There can be perspective distortion and occlusion, but the spatio-temporal distribution can be highlighted. [R. Krueger]. Space-time cubes show change over time within geographic space. Each cube represents a slice of time, in which the topmost cube has the newest timestamp. Temporal changes in that geographic area can then be visualized. Map below shows a space-time cube web scene in ArcGIS Online (AGOL).

crosstalk - To create links between objects for interactivity raster - to handle geospatial data sf - to handle geospatial data tmap - for map visualization clock - to take care of time value (newer r package compared to lubridate) lubridate - mapview - to visualize maps
raster, tmap, sf, clock (for movement visualization)
for geospatial data tmap for map visualization raster and sf for handling geospatial data can use clock package to suss out the day of week/day of year for our MC2 question.
The following information is available.
The type of data required would vary with the questions. This section would only cover the main data set up, EDA, quality of data, any general manipulations such as correcting the format of the data.
Special data manipulation specific to the questions would be covered in their respective sections instead.
We first load the datasets, via the read_csv() function.
car_assigned <- read_csv("data/car-assignments.csv")
gpstracking <- read_csv("data/gps.csv")
loyaltycard <- read_csv("data/loyalty_data.csv")
creditcard <- read_csv("data/cc_data.csv")
To check for missing values, we use the naniar package. The naniar package provides tidy ways to summarize, visualize and manipulate missing data.
#For car_assigned
miss_var_summary(car_assigned)
# A tibble: 5 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 CarID 9 20.5
2 LastName 0 0
3 FirstName 0 0
4 CurrentEmploymentType 0 0
5 CurrentEmploymentTitle 0 0
#For creditcard
miss_var_summary(creditcard)
# A tibble: 4 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 timestamp 0 0
2 location 0 0
3 price 0 0
4 last4ccnum 0 0
#For gpstracking
miss_var_summary(gpstracking)
# A tibble: 4 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 Timestamp 0 0
2 id 0 0
3 lat 0 0
4 long 0 0
#For loyaltycard
miss_var_summary(loyaltycard)
# A tibble: 4 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 timestamp 0 0
2 location 0 0
3 price 0 0
4 loyaltynum 0 0
Here, we observe that the only dataset with missing values is car_assigned. From the filter below, we see that the truck drivers are not assigned cars. For now, other than to acknowledge this fact, we are indifferent to the missing values as there seems to be no other dataset which contains LastName and FirstName to be able to use the information below. Later, we would see that there is location tracking information on vehicles not within the cars list. These vehicles can be associated with any of the truck drivers below.
For now, these rows can also be removed since we will be doing a join with the other datasets and would require a uniquely valid column without missing values. They will be removed via the complete.cases() function.
# A tibble: 9 x 5
LastName FirstName CarID CurrentEmploymentTy~ CurrentEmploymentTit~
<chr> <chr> <dbl> <chr> <chr>
1 Hafon Albina NA Facilities Truck Driver
2 Hawelon Benito NA Facilities Truck Driver
3 Hawelon Claudio NA Facilities Truck Driver
4 Mies Henk NA Facilities Truck Driver
5 Morlun Valeria NA Facilities Truck Driver
6 Morlun Adan NA Facilities Truck Driver
7 Morluniau Cecilia NA Facilities Truck Driver
8 Nant Irene NA Facilities Truck Driver
9 Scozzese Dylan NA Facilities Truck Driver
car_assigned_only <- car_assigned[complete.cases(car_assigned),]
car_assigned_only
# A tibble: 35 x 5
LastName FirstName CarID CurrentEmploymentT~ CurrentEmploymentT~
<chr> <chr> <dbl> <chr> <chr>
1 Calixto Nils 1 Information Techno~ IT Helpdesk
2 Azada Lars 2 Engineering Engineer
3 Balas Felix 3 Engineering Engineer
4 Barranco Ingrid 4 Executive SVP/CFO
5 Baza Isak 5 Information Techno~ IT Technician
6 Bergen Linnea 6 Information Techno~ IT Group Manager
7 Orilla Elsa 7 Engineering Drill Technician
8 Alcazar Lucas 8 Information Techno~ IT Technician
9 Cazar Gustav 9 Engineering Drill Technician
10 Campo-Corr~ Ada 10 Executive SVP/CIO
# ... with 25 more rows
Next, we shall ensure that the Timestamps are in the right and consistent format, using the lubridate package. For the loyaltycard dataset, the timestamp is only in mdy format. We will observe that the timestamp will be converted to POSIXct (for creditcard and gpstracking dataset), and Date (for loyaltycard as there is only date data).
#For creditcard dataset
creditcard$TimeStampFormatted <-mdy_hm(creditcard$timestamp)
str(creditcard)
spec_tbl_df [1,490 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ timestamp : chr [1:1490] "01/06/2014 07:28" "01/06/2014 07:34" "01/06/2014 07:35" "01/06/2014 07:36" ...
$ location : chr [1:1490] "Brew've Been Served" "Hallowed Grounds" "Brew've Been Served" "Hallowed Grounds" ...
$ price : num [1:1490] 11.34 52.22 8.33 16.72 4.24 ...
$ last4ccnum : num [1:1490] 4795 7108 6816 9617 7384 ...
$ TimeStampFormatted: POSIXct[1:1490], format: "2014-01-06 07:28:00" ...
- attr(*, "spec")=
.. cols(
.. timestamp = col_character(),
.. location = col_character(),
.. price = col_double(),
.. last4ccnum = col_double()
.. )
#Delete timestamp column
creditcard <- creditcard %>%
dplyr::select(-timestamp)
#Reorder columns
col_order <- c("TimeStampFormatted", "location","price","last4ccnum")
creditcard <- creditcard[, col_order]
#For gpstracking dataset
gpstracking$TimeStampFormatted <-mdy_hms(gpstracking$Timestamp)
str(gpstracking)
spec_tbl_df [685,169 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Timestamp : chr [1:685169] "01/06/2014 06:28:01" "01/06/2014 06:28:01" "01/06/2014 06:28:03" "01/06/2014 06:28:05" ...
$ id : num [1:685169] 35 35 35 35 35 35 35 35 35 35 ...
$ lat : num [1:685169] 36.1 36.1 36.1 36.1 36.1 ...
$ long : num [1:685169] 24.9 24.9 24.9 24.9 24.9 ...
$ TimeStampFormatted: POSIXct[1:685169], format: "2014-01-06 06:28:01" ...
- attr(*, "spec")=
.. cols(
.. Timestamp = col_character(),
.. id = col_double(),
.. lat = col_double(),
.. long = col_double()
.. )
#Delete timestamp column
gpstracking <- gpstracking %>%
dplyr::select(-Timestamp)
#Reorder columns
col_order <- c("TimeStampFormatted", "id","lat","long")
gpstracking <- gpstracking[, col_order]
#For loyaltycard dataset
loyaltycard$TimeStampFormatted <- mdy(loyaltycard$timestamp)
str(loyaltycard)
spec_tbl_df [1,392 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ timestamp : chr [1:1392] "01/06/2014" "01/06/2014" "01/06/2014" "01/06/2014" ...
$ location : chr [1:1392] "Brew've Been Served" "Brew've Been Served" "Hallowed Grounds" "Coffee Shack" ...
$ price : num [1:1392] 4.17 9.6 16.53 11.51 12.93 ...
$ loyaltynum : chr [1:1392] "L2247" "L9406" "L8328" "L6417" ...
$ TimeStampFormatted: Date[1:1392], format: "2014-01-06" ...
- attr(*, "spec")=
.. cols(
.. timestamp = col_character(),
.. location = col_character(),
.. price = col_double(),
.. loyaltynum = col_character()
.. )
We also observe special unidentifiable characters in Katerina’s Cafe in the creditcard and loyaltycard dataset. Those shall be identified and replaced using the str_replace_all() function to prevent error in data processing.
Next, we include into the gpstracking dataset, the first and last names of the personnel the car is assigned to. We essentially want to do a left join for the gpstracking dataset, and the car_assigned dataset, by the car ID. We can do this with the left_join() function.
gpsname <- left_join(gpstracking, car_assigned_only, by = c("id" = "CarID"))
#Join First Name and Last Name
gpsname$name <- paste(gpsname$FirstName, gpsname$LastName)
#Reorder cols
col_order <- c("TimeStampFormatted", "id","lat","long","name","LastName","FirstName","CurrentEmploymentType","CurrentEmploymentTitle")
gpsname <- gpsname[, col_order]
Next, we shall prepare the geospatial map for viewing. We would use the Raster package to import the raster file for the map of Abila.
The file to be imported is already georeferenced using qGIS, into .tif format. Note that the raster layer is a three bands false colour image, we would use tm_rgb() instead of tm_raster() to be able to display all three bands. If not, the layer would come out in monochrome.
#Importing the Raster file
bgmap <- raster("data/MC2/MC2-tourist_modified.tif")
bgmap
class : RasterLayer
band : 1 (of 3 bands)
dimensions : 1606, 2708, 4349048 (nrow, ncol, ncell)
resolution : 3.155964e-05, 3.155964e-05 (x, y)
extent : 24.82426, 24.90973, 36.04486, 36.09555 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs
source : MC2-tourist_modified.tif
names : MC2.tourist_modified
values : 0, 255 (min, max)
#Plotting the Raster Layer.
tmap_mode("plot")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255)

Next we input the vectors, which comes in the form of ESRI shapefile format. We shall import this shapefile using st_read() of the sf package. This shapefile is already coded in WGS 84 format.
Abila_st <- st_read(dsn = "data/MC2/Geospatial",
layer = "Abila")
Reading layer `Abila' from data source
`C:\peppermt\Visual Blogs\DvMoBlog\_posts\2021-06-26-assignment\data\MC2\Geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS: WGS 84
Then, we shall map the aspatial data next. This gps file had already been imported earlier as gpstracking. We would need to convert it into a Simple Feature Data Frame using st_as_sf() of the sf package. Coordinates would no longer be in dbl format, but would be in geometry format. They would be input as longitude (‘long’ ; x-coordinates) and latitude (‘lat’ ; y-coordinates), in the EPSG: 4326 format, which is the wgs84 Geographic Coordinate System.
gps_sf <- st_as_sf(gpstracking,
coords = c("long", "lat"),
crs = 4326)
We can view an interactive map with all the locations plotted as points via the mapview() function.
We create the movement path from GPS points next for each car. Hence, we need to group the data by the car ID, the identifier. As R requires a command following the group_by() function, an input will be required for the code to run, so we include a dummy summarize() code to overcome this issue.
All gps points will be joined with a line, and output via st_cast.
The following date and time versions for sf is created to enable easier filtering later.
#Including Date and Time for gps_sf
gps_sf$date <- as_date(gps_sf$TimeStampFormatted)
gps_sf$time <- hms::as_hms(gps_sf$TimeStampFormatted)
gps_sf$thirtymins <- hms::as_hms(round_date(gps_sf$TimeStampFormatted, "30 mins"))
#Creating Path with Group by Date
gps_path_date <- gps_sf %>%
group_by(id, date) %>%
summarize(m = mean(TimeStampFormatted),
do_union = FALSE) %>%
st_cast("LINESTRING")
#Creating Path with Group by Time
gps_path_time <- gps_sf %>%
group_by(id, thirtymins) %>%
summarize(m = mean(TimeStampFormatted),
do_union = FALSE) %>%
st_cast("LINESTRING")
#Creatiang Path with Group by Date Time
gps_path_datetime <- gps_sf %>%
group_by(id, date, thirtymins) %>%
summarize(m = mean(TimeStampFormatted),
do_union = FALSE) %>%
st_cast("LINESTRING")
We then create a selection code to filter the gps paths of the selected car ID, to be visualized on the map, for our use. We can select the appropriate dataset (i.e. gps_path_time if we want to filter by time, gps_path_datetime if we want to filter by datetime).
gps_path_selected <- gps_path_date %>%
filter(id == 101, date == "2014-01-09")
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_selected) +
tm_lines()
Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?
From the bar plot above, by credit card transactions, Katerina’s Cafe, Hippokampos, Guy’s Gyros and Brew’ve Been Served are the most popular locations. We shall see the outcome when the loyalty card transactions are analyzed in the same manner. From the bar plot below, the same results are derived for the Top 4 most popular locations.
However, the bar plots only tell us the total number of visits over the 2 weeks. We do not know which days have higher visits, or the profile of the visits over the day. To view this, we can use violin plots. The thickness of the plots will also reflect the frequency of the visit at that point in time. We will also be able to see which days did the visits take place.
We shall use the violin plots in the plotly package as they are interactive, we can hover over the plots, especially where they are thicker to see what time periods are those.
From the violin plot above, we can see that the top 3 frequented locations identified above (Katerina’s Cafe, Hippokampos, Guy’s Gyros, Brew’ve been Served) have a pretty consistent visit rate daily throughout the 2 weeks from 6th to 20th Jan. The thickness of the violin plot is consistent. This means that they are popular throughout the 2 weeks.
However, it is observed that other locations can have higher visiting rates at specific periods than the identified Top 3. For example, the visiting rates for Desaflo Golf Course was much higher specifically on 19th Jan, same for Kronos Pipe and Irrigation on 7th Jan. Hallowed Grounds also had peak visiting rates on 8th and 15th Jan.
The violin plot showed a good overview that the visiting days vary across the locations. Now we shall dissect this further into day and time.
We shall use the geom_tile plot from ggplot to visualize this, wrapped with ggplotly() to provide us with interactivity as we would be able to hover over each square and see the corresponding location and date/time. One tile plot would be for “Visits at Each Location for Each Day” and “Visits at Each Location for Time of Day”.
We would first prepare the dataset for this plot by extracting the date and time data into different columns, then do a count for them respectively. As the timestamp has been formatted into POSIXct format earlier, R already stores it in dttm (date-time) format. We can just use the as_date and as_hms functions from the lubridate package to extract them out. We should retain them in date and time format for easier manipulation later.
Next, we would count the number of visits to each location for each Day and time period. While the Date variable is more discrete and easier to count, more work needs to be done to count for the Time variable as it would not make sense to count how many visits are there when the Time value is detailed to the second. Hence, we will round the timestamp down to the nearest half hour for this count, via the round_date() function.
#Count by Day
cardbyday <- creditcard %>%
group_by(date) %>%
add_count(date, location) %>%
distinct(date, location, n)
loyalcardbyday <- loyaltycard %>%
group_by(TimeStampFormatted) %>%
add_count(TimeStampFormatted, location) %>%
distinct(TimeStampFormatted, location, n)
#Count by Time
#Round the time to the nearest half hour
cardbytime <- creditcard %>%
group_by(thirtymins = hms::as_hms(round_date(TimeStampFormatted, "30 mins"))) %>%
add_count(thirtymins, location) %>%
distinct(thirtymins, location, n)
With the data prepared, we shall do the tile plot via geom_tile() from ggplot, then wrapped by ggplotly() for interactivity.
This would be for the “Visits at each Location for Each Day (Credit Card)”.
This is the same plot plotted via loyalty card data.
Next, this would be for “Visits at Each Location for Time of Day”, which can only come from the credit card transactions dataset as transaction time information is not present in loyalty card dataset.
From the plots above, we could see that Katerina’s Cafe, Hippokampos, Guy’s Gyros and Brew’ve Been Served have the lightest bars across the 2 weeks. This shows that they are consistently popular. We also note that Guy’s Gyros is more popular during weekdays than weekends, and Brew’ve Been Served only has visits on weekdays, indicating that it is probably not open on weekends.
| No. | Location | Periods Popular | Remarks |
|---|---|---|---|
| 1 | Katerina’s Cafe | All Days ; Lunchtime (1300 hrs - 1430 hrs), Dinnertime (1900 hrs - 2130 hrs) | NIL |
| 2 | Hippokampos | All Days ; Lunchtime (1230 hrs - 1430 hrs), Dinnertime (1930 hrs - 2230 hrs) | NIL |
| 3 | Guy’s Gyros | More Popular During Weekdays ; Lunchtime (1300 hrs - 1430 hrs), Dinnertime (1900 hrs - 2130 hrs) | Unusually crowded on 19 Jan (Sun) compared to the previous Sunday and weekend crowd |
| 4 | Brew’ve Been Served | Weekday Mornings ; Morning Coffee (0730 hrs - 0830 hrs) | Huge Crowd at 0800 hrs ; Seems closed on weekends |
We note that the above popular locations are all identified on the map of Abila, and are located near Gastech, all except for Hippokampos, which is nowhere to be found on the map. This is an anomaly. To be able to solve the mystery of where Hippokampos is, we would have to delve into other data like vehicle tracking.
U Pump is a refuel station, yet it is only visited twice from the credit card transactions, on two separate days - 6th and 13th Jan. The frequency of refuel is unusually infrequent. It could be due to employees not needing to pay for the fuel using their own credit cards. But strange thing is the loyalty card is also not used either. The same goes for Frank’s Fuel, there were only transactions on 2 days - 8th and 18th Jan. This is just an abnormal situation, which we hope to seek further insights later.
Supermarkets should have daily transactions. However, according to credit card transaction data, Kronos Mart and General Grocer can have 3 - 5 days without any transactions within the 2 weeks. Maximum number of transactions only goes up to 3 for Kronos Mart, and 2 for General Grocer for one day each. Business seems awfully bad.
When we compare this with loyalty card transaction data, there are still days without any transactions, but we see the following differences:
| Date | Credit Card Data | Loyalty Card Data |
|---|---|---|
| 9 Jan | No transaction | 1 transaction |
| 10 Jan | 1 transaction | No transaction |
| 12 Jan | 1 transaction | 2 transactions |
| 13 Jan | 2 transactions | No transaction |
| 14 Jan | 1 transaction | No transaction |
| 15 Jan | No transaction | 1 transaction |
| 17 Jan | 1 transaction | No transaction |
| 18 Jan | No transaction | 3 transactions |
| 19 Jan | 3 transactions | No transaction |
This means that there are transactions paid by cash, not paid by credit card. But business still seems abnormally bad.
Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?
From the above plot, we see that most employees, except for Sanjorge Jr Sten, traveled daily. Sanjorge Jr Sten is the CEO of Gastech, and was likely chauffeured instead of driving on his own. We also notice that there is a high count of travels on weekdays with LastName = NA. Checking on the gpsname data, we find that there is gps tracking information for vehicle ID 105 and 107, which do not exist in the car_assigned dataset, this is why LastName was NA. Tracking of the route for vehicle ID 105 and 107 are as below.


The route shows that these vehicles comes out from GasTech, goes round nearby areas, and would then go back to GasTech. 105 would go to Carlyle Chemical Inc., Nationwide Refinery, and Guy’s Gyros. 107 would also travel to Maximum Iron and Steel often. Since both vehicles only move on weekdays, they are likely the company trucks, which are not assigned to a specific employee. The personnel who can use these vehicles are as tabulated below:
# A tibble: 9 x 5
LastName FirstName CarID CurrentEmploymentTy~ CurrentEmploymentTit~
<chr> <chr> <dbl> <chr> <chr>
1 Hafon Albina NA Facilities Truck Driver
2 Hawelon Benito NA Facilities Truck Driver
3 Hawelon Claudio NA Facilities Truck Driver
4 Mies Henk NA Facilities Truck Driver
5 Morlun Valeria NA Facilities Truck Driver
6 Morlun Adan NA Facilities Truck Driver
7 Morluniau Cecilia NA Facilities Truck Driver
8 Nant Irene NA Facilities Truck Driver
9 Scozzese Dylan NA Facilities Truck Driver
We also notice that there was unusually high movement for the trucks on 16 Jan - 26116 gps points/day, compared to the usual 3000 - 6000 points/day on the other days within these 2 weeks. Further investigation will be covered in Question 5.
Next, we will use the function below to do a plot of travel time periods on each day, for each personnel.
We see that there are frequent visits to the supermarkets and pump refuel stations. This means that there were transactions and business, but the transactions were not paid in credit card and loyalty cards were not used either.
Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?
#```{r Look at CC and Loyalty Card Matched}
#Prep Data first by joining loyalty card number into credit card dataset
credit_loyalty <- left_join(creditcardB, loyaltycardB, by = c(“price” = “price”, “date” = “TimeStampFormatted”, “location” = “location”))
#To see if credit card records and loyalty card pairs are unique
creditloyaltyuniq <- credit_loyalty %>% group_by(last4ccnum, loyaltynum) %>% summarize(Count = n())
creditloyaltyuniq
howmany <- loyaltycard %>% distinct(loyaltynum) howmany
figD <- creditloyaltyuniq %>% ggplot(aes(x = last4ccnum, y = loyaltynum)) + geom_tile(aes(fill = Count)) + scale_fill_distiller(palette = “YlGnBu”, direction = -1) + labs(title = “No. of Purchases on Each Credit Card and Loyalty Card Pair”, x = “Credit Card No.”, y = “Loyalty Card No.”)
ggplotly(figD)
#figE <- sankeyNetwork(Links =
This is the anomaly:
One loyalty card associated with multiple credit cards:
L9406 associated with 4948 (22) and 5921 (1).
L8566 associated with 4795 (25) and 8332 (1).
L6119 associated with 7889 (20) and 5368 (1).
L3295 associated with 4948 (1) and 5921 (12).
L3288 associated with 1286 (15) and 9241 (13).
L2247 associated with 5368 (24) and 7889 (1).
L2070 associated with 4795 (1) and 8332 (27).
One credit card associated with multiple loyalty cards:
1286 associated with L3288 (15) and L3572 (13)
4795 associated with L2070 (1) and L8566 (25)
4948 associated with L3295 (1) and L9406 (22)
5368 associated with L2247 (24) and L6119 (1)
5921 associated with L3295 (12) and L9406 (1) [Same pattern as 4948! Opposite!]
7889 associated with L2247 (1) and L6119 (20) [Same pattern with 5368! Opposite!]
8332 associated with L2070 (27) and L8566 (1) [Same pattern as 4795!]
fig <- plot_ly(
type = "sankey",
orientation = "h",
node = list(
label = c("A1", "A2", "B1", "B2", "C1", "C2"),
color = c("blue", "blue", "blue", "blue", "blue", "blue"),
pad = 15,
thickness = 20,
line = list(
color = "black",
width = 0.5
)
),
link = list(
source = c(0,1,0,2,3,3),
target = c(2,3,3,4,4,5),
value = c(8,4,2,8,4,2)
)
)
fig <- fig %>% layout(
title = "Basic Sankey Diagram",
font = list(
size = 10
)
)
Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships.
Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why
Unusually high movement activity for truck on 16 Jan 2014.
1 visit to Daily Dealz on 13 Jan 2014, at 6am in the morning.
Sanjorge Jr. only had 3 day of GPS records from 17th to 19th Jan, while everyone else travelled almost daily.
Stolen credit card?
L3288 associated with 1286 (15) and 9241 (13).1286 associated with L3288 (15) and L3572 (13)
5921 associated with L3295 (12) and L9406 (1) [Same pattern as 4948! Opposite!]. 4948 associated with L3295 (1) and L9406 (22)
7889 associated with L2247 (1) and L6119 (20) [Same pattern with 5368! Opposite!] 5368 associated with L2247 (24) and L6119 (1)
8332 associated with L2070 (27) and L8566 (1) [Same pattern as 4795!] 4795 associated with L2070 (1) and L8566 (25)
Guy’s Gyros is unusually crowded on 19 Jan (Sun) compared to the previous Sunday and weekend crowd.
Supermarket transactions in the wee hours, with high transaction amounts
The above plot also shows that there were 5 transactions made at Kronos Mart in the wee hours between 0300 - 0400 hrs. The 5 transactions are as shown in the table below. We also observe that the transaction amounts are unusually high, for a midnight purchase.
kronos <- creditcard %>% filter(location == "Kronos Mart") %>%
arrange(time) %>% slice_min(time,n = 6)
kronos
# A tibble: 6 x 6
TimeStampFormatted location price last4ccnum date time
<dttm> <chr> <dbl> <dbl> <date> <time>
1 2014-01-13 03:00:00 Kronos Mart 147. 5407 2014-01-13 03:00
2 2014-01-19 03:13:00 Kronos Mart 87.7 3484 2014-01-19 03:13
3 2014-01-12 03:39:00 Kronos Mart 277. 8156 2014-01-12 03:39
4 2014-01-19 03:45:00 Kronos Mart 195. 9551 2014-01-19 03:45
5 2014-01-19 03:48:00 Kronos Mart 150. 8332 2014-01-19 03:48
6 2014-01-16 07:30:00 Kronos Mart 299. 7108 2014-01-16 07:30